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Abstract 

In this article, we develop and investigate a new classiher based on features ex¬ 
tracted using spatial depth. Our construction is based on fitting a generalized additive 
model to the posterior probabilities of the different competing classes. To cope with 
possible multi-modal as well as non-elliptic population distributions, we develop a lo¬ 
calized version of spatial depth and use that with varying degrees of localization to 
build the classifier. Final classification is done by aggregating several posterior prob¬ 
ability estimates each of which is obtained using localized spatial depth with a fixed 
scale of localization. The proposed classiher can be conveniently used even when the 
dimension is larger than the sample size, and its good discriminatory power for such 
data has been established using theoretical as well as numerical results. 

Keywords : Bayes classiher, elliptic and non-elliptic distributions, HDLSS asymp¬ 
totics, uniform strong consistency, weighted aggregation of posteriors. 
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1 Introduction 


In a classification problem with J classes, we usually have Uj labeled observations x^i,..., 
from the j-th class (1 < j < J), and we use these n = J2j=i observations to construct a de¬ 
cision rule for classifying a new unlabeled observation x to one of these J pre-dehned classes. 
If TTj and fj respectively denote the prior probability and the probability density function of 
the j-th class, and p(j|x) denotes the corresponding posterior probability, the optimal Bayes 
classifier assigns x to the class j*, where j* = argmaxi<j<jp(j|x) = argmaxi<j<j7rj/j(x). 
However, the /j(x)’s (or, the p(j|x)’s) are unknown in practice, and one needs to estimate 
them from the training sample of labeled observations. Popular parametric classihers like 
linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are motivated 
by parametric model assumptions on the underlying class distributions. So, they may lead 
to poor classihcation when the model assumptions fail to hold, and the class boundaries 
of the Bayes classiher have complex geometry. On the other hand, nonparametric classi¬ 
hers like those based on /c-nearest neighbors (/c-NN) and kernel density estimates (KDE) 
are more hexible and free from such model assumptions. But, they suffer from the curse of 
dimensionality and are often not suitable for high-dimensional data. 

Consider two examples denoted by El and E2, respectively. El involves a classihcation 
problem with two classes in W^, where the distribution of the hrst class is an equal mixture of 
Nd(Orf, Irf) and NfiOfi, 101^), and that for the second class is Nc;(0d, 51^). Here denotes the 
d-variate normal distribution, 0^ = (0,..., 0)^ G and 1^ is the dxd identity matrix. In E2, 
each class distribution is an equal mixture of two uniform distributions. The distribution for 
the hrst (respectively, the second) class is a mixture of 11^(0,1) and 11^(2, 3) (respectively, 
Ud(l,2) and 11^(3,4)). Here Ud(ri,r 2 ) denotes the uniform distribution over the region 
{x G : ri < ||x|| < r 2 } with 0 < ri < r 2 . Figure 1 shows the class boundaries of the 
Bayes classiher for these two examples when d = 2, and tti = 7r2 = 1/2. The regions colored 
grey (respectively, black) correspond to observations classihed to the hrst (respectively, the 
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(a) Example El (b) Example E2 

Figure 1; Bayes class boundaries in . 

second) class by the Bayes classifier. It is clear that classifiers like LDA and QDA or any 
other classifier with linear or quadratic class boundaries will deviate significantly from the 
Bayes classifier in both examples. A natural question then is how standard nonparametric 
classifiers like those based on fc-NN and KDE perform in such examples. In Figure 2, we 
have plotted average misclassification rates of these two classifiers along with the Bayes risks 
for different values of d. These classifiers were trained on a sample of size 100 generated from 
each class distribution, and the misclassification rates were computed based on a sample of 
size 250 from each class. This procedure was repeated 500 times to calculate the average 
misclassification rate. Smoothing parameters associated with fc-NN and KDE (i.e., the k in 
fc-NN and the bandwidth in KDE) were chosen by minimizing leave-one-out cross-validation 
estimates of misclassification rates mi. Figure 2 shows that in El, the Bayes risk decreases 
to zero as d grows. Since the class distributions in E2 have disjoint supports, the Bayes risk 
is zero irrespective of the value of d. But in both examples, the misclassification rates of 
these two nonparametric classifiers increased to almost 50% as d increased. 

These two examples clearly show the necessity to develop new classifiers to cope with 
such situations. Over the last three decades, data depth (see, e.g., [291 n has emerged 
as a powerful tool for data analysis with applications in many areas including supervised 
and unsupervised classification (see [201 HH US [181 ESI IZl ISSl |23l [33]). Spatial depth (also 
known as the Li depth) is a popular notion of data depth that was introduced and studied 
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(a) Example El (b) Example E2 

Figure 2: Misclassification rates of nonparametric classifiers and the Bayes 
classifier for d = 2, 5,10, 20,50 and 100. 

in |38] and [37]. The spatial depth (SPD) of an observation x G w.r.t. a distribution 
function F on is defined as SPD(x, F) = 1 —||Fi?{M((x — X))}||, where X ~ F and «(•) 
is the multivariate sign function given by m(x) = ||x|p^x if x 7 ^ 0^ G and ^(0^) = 0^. 
Henceforth, || • || will denote the Euclidean norm. Spatial depth is often computed on the 
standardized version of the data. In that case, SPD is defined as 

SPD(x,F) = l-||Fi.{n(S-^/2(x-X))}||, 

where S is a scatter matrix associated with F. If S has the affine equivariance property, 
this version of SPD is affine invariant. 

Like other depth functions, SPD provides a centre-outward ordering of multivariate data. 
An observation has higher (respectively, lower) depth if it lies close to (respectively, away 
from) the centre of the distribution. In other words, given an observation x and a pair 
of probability distributions Fi and F 2 , if SPD(x, Fi) is larger than SPD(x, F 2 ), one would 
expect X to come from Fi instead of F 2 . Based on this simple idea, the maximum depth 
classifier was developed in naEU]. For a J-class problem involving distributions Fi,..., Fj, 
it classifies an observation x to the j*-th class, where j* = argmaxi<j<j SPD(x, Fj). 

An important property of SPD (see Lemma 1 in Appendix) is that when the class distri¬ 
bution F is unimodal and spherically symmetric, the class density function turns out to be 
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(a) Example El (b) Example E2 

Figure 3: SPD(x, F\) and SPD(x, F 2 ) for different values of ||x|| when x G . 

a monotonically increasing function of SPD. In both examples El and E2, the class distri¬ 
butions are spherical. Consequently, SPD(x, F) is a function of ||x|| in view of the rotational 
invariance of SPD(x, F). In Figure 3, we have plotted SPD(x, Fi) and SPD(x, F 2 ) for differ¬ 
ent values of ||x|| for examples El and E2, where Fi and F 2 are the two class distributions 
and X G It is transparent from the plots that the maximum depth classiher based on 
SPD will fail in both examples. In example El, for all values of ||x|| smaller (respectively, 
greater) than a constant close to 4, the observations will be classihed to the hrst (respec¬ 
tively, the second) class by the maximum SPD classiher. On the other hand, this classiher 
will classify all observations to the second class in example E2. 

In Section 2, we develop a modihed classiher based on SPD to overcome this limitation 
of the maximum depth classiher. Most of the existing modihed depth based classihers are 
developed mainly for two class problems (see, e.g., [121 13 [2S1 |33l [23]). For classihcation 
problems involving J(> 2) classes, one usually solves ( 2 ) binary classihcation problems tak¬ 
ing one pair of classes at a time and then uses majority votes to make the hnal classihcation. 
Our proposed classihcation method based on SPD addresses the J class problem directly. 

Almost all depth based classihers proposed in the literature require ellipticity of class 
distributions to achieve Bayes optimality. In order to cope with possible multimodal as well 
as non-elliptic population distributions, we construct a localized version of SPD (henceforth 
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referred to as LSPD) in Section 3. In Section 4, we develop a multiscale classifier based 
on LSPD. Relevant theoretical results on SPD, LSPD and the resulting classihers have also 
been studied in these sections. 

An advantage of SPD over other depth functions is its computational simplicity. Clas¬ 
sihers based on SPD and LSPD can be constructed even when the dimension of the data 
exceeds the sample size. We deal with such high-dimensional low sample size cases in Section 
5, and show that both classihers turn out to be optimal under a fairly general framework. 
In Sections 6 and 7, some simulated and benchmark data sets are analyzed to establish the 
usefulness of our classihcation methods. Section 8 contains a brief summary of the work and 
some concluding remarks. All proofs and mathematical details are given in the Appendix. 


2 Bayes optimality of a classifier based on SPD 


Let us assume that /i,... ,/j are the density functions of J elliptically symmetric distri¬ 
butions on where /j(x) = |Sj— ^J,j)\\) for I < j < J- Here Hj G W^, 

Ylj is a. d X d positive dehnite matrix, and (?j(||t||) is a probability density function on 
for 1 < j < J. For such classihcation problems involving general elliptic populations with 
equal or unequal priors, the next theorem establishes the Bayes optimality of a classiher, 
which is based on z(x) = ( 2 ;i(x),..., zj{x)Y' = (SPD(x, Fi), ..., SPD(x, Fj)Y, the vector 
of SPD. In particular, it follows from this theorem that for examples El and E2 discussed 
at the beginning of Section 1, the class boundaries (see Figure 1) of the Bayes classihers are 
functions of z(x) = (SPD(x, Fi), SPD(x, F 2 ))^. 


Theorem 1 If the densities of the J competing classes are elliptically symmetric, the pos 
terior probabilities of these classes satisfy the logistic regression model given by 

exp($^(z(x))) 


P(j|x) =p(j|z(x)) = 


[1 + Eili^^exp($fc(z(x)))] 


for 1 < j < (J - 1) 


and p(J|x) = p(J|z(x)) = 


[1 + Eli/^exp(4>fc(z(x)))] 


( 1 ) 

( 2 ) 
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Here <l>j(z(x)) = 99ji(2;i(x)) +.. . + ipjj{zj{'x)), and tpjiS are appropriate real-valued functions 
of real variables. Consequently, the Bayes rule assigns an observation x to the class j*, where 
j* = arg maxi<j<j p(j|z(x)). 

Theorem 1 shows that the Bayes classiher is based on a nonparametric multinomial 
additive logistic regression model for the posterior probabilities, which is a special case 
of generalized additive models (GAM) [16]. If the prior probabilities of the J classes are 
equal, and fi,..., fj are all elliptic and unimodal differing only in their locations, this Bayes 
classiher reduces to the maximum SPD classiher [T21 ED] (see Remark 1 after the proof of 
Theorem 1 in the Appendix). 

For any hxed i and j, one can calculate the J-dimensional vector z(xjj), where x^j is 
the i-th training sample observation in the j-th class for 1 < i and 1 < j < J. 

These z(xjj)s can be viewed as realizations of the vector of co-variates in a nonparametric 
multinomial additive logistic regression model, where the response corresponds to the class 
label that belongs to {1,..., J}. So, a classiher based on SPD can be constructed by htting a 
generalized additive model with the logistic link function. In practice, when we compute SPD 
of X from the data xi,..., x„ generated from F, we use its empirical version as SPD(x, Fn) = 
1 — 11^ X]r=i ~ For the standardized version of the data, it is dehned as 



where S is an estimate of S, and F^ is the empirical distribution of the data xi,... ,x„. 
The resulting classiher worked well in examples El and E2, and we shall see it in Section 6. 

3 Extraction of small scale distributional features by 
localization of spatial depth 

Under elliptic symmetry, the density function of a class can be expressed as a function of SPD, 
and hence the SPD contours coincide with the density contours. This is the main mathemat- 


7 






ical argument used in the proof of Theorem 1. Now, for certain non-elliptic distributions, 
where the density function cannot be expressed as a function of SPD, such mathematical 
arguments are no longer valid. For instance, consider an equal mixture of Nd(Od, 0.251^), 
N(i(21(i, 0.25Id) and Nc;(41d, 0.251^), where 1^ = (1, • • •, 1)^- We have plotted its SPD con¬ 
tours in Figure 4 when d = 2. For this trimodal distribution, the SPD contours fail to match 
the density contours. As a second example, we consider a d-dimensional distribution with 
independent components, where the i-th component is exponential with the scale parameter 
d/{d — i + 1) for 1 < i < d. We have plotted its SPD contours in Figure 5 when d = 2. 
Even in this example, the SPD contours differ significantly from the density contours. To 
cope with this issue, we suggest a localization of SPD (see the third contour plots (c) in 
Figures 4 and 5). As we shall see later, this localized SPD relates to the underlying density 
function, and the resulting classiher turns out to be the Bayes classiher (in a limiting sense) 
in a general nonparametric setup with arbitrary class densities. 

Note that SPD(x, F) = 1 — ||Fj7{m(x —X)}|| is constructed by assigning the same weight 
to each unit vector m(x — X) ignoring the signihcance of distance between x and X. By 
introducing a weight function, which depends on this distance, one can extract important 
features related to the local geometry of the data. To capture these local features, we 
introduce a kernel function K{-) as a weight and dehne 

P;,(x,F) = EF[Kh{t)] - \\EF[Kh{t)u{t)]l 

where t = (x — X) and Kh{t) = h~'^K{t/h). Here K is chosen to be a bounded continuous 
density function on such that A'(t) is a decreasing function of ||t|| and A'(t) —)■ 0 as 
||t|| ^ oo. The Gaussian kernel K{t) = (\/^)“'^exp{ — ||t||^/2} is a possible choice. It is 
desirable that the localized version of SPD approximates the class density or a monotone 
function of it for small values of h. This will ensure that the class densities and hence, the 
class posterior probabilities become functions of the local depth as h —)■ 0. On the other 
hand, one should expect that as h —)■ cxo, the localized version of SPD should tend to SPD 



(a) Density (b) SPD (c) LSPD?i =.4 

Figure 4; Contours of density, SPD and LSPD/j (with h = .4) functions for a 

symmetric, trimodal density function. 





(a) Density (b) SPD (c) LSPD?i =.25 

Figure 5: Contours of density, SPD and LSPD/^ (with h = .25) functions for the 
density function f{xi, X 2 ) = .5 exp{ —(xi + .bx 2 )}I{xi > 0, X 2 > 0} . 


or a monotone function of it. However, Fft(x, F) ^ 0 as /i —)■ 00 . So, we re-scale F/i(x, F) 
by an appropriate factor to define the localized spatial depth (LSPD) function as follows: 


LSPD;,(x,F) 


P;,(x,F) ifh<l, 
h'^P;,(x,F) if/i>l. 


(3) 


Using t = — X) in the dehnition of P/i(x, F), one gets LSPD on standardized data, 

which is affine invariant if S is affine equivariant. LSPD/i dehned this way is a continuous 
function of h, and Zft(x) = (LSPDft(x, Fi), ..., LSPDft(x, Fj))^ has the desired behavior as 
shown in Theorem 2. 


Theorem 2 Consider a kernel function K{t) that satisfy ||t||F(t)dt < 00 . ///i,..., fj 
are continuous density functions with hounded first derivatives, and the scatter matrix 
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corresponding to fjipi.) exists for all I < j < J, then 

(a) Zhix) (|Si|i/Vi(x),..., |Sj|1/Vj(x))^ as h-)-0, and 

(b) Zh{^) {K{0)SPD{^, i^i), • • •, K{0)SPD{:s^, Fj)f as h ^ oo. 

Now, we construct a classifier by plugging in LSPD/i instead of SPD in the GAM discussed 
in Section 2. So, we consider the following model for the posterior probabilities 

exp{^j{zhix.))) 


p{j\zhix)) = 


[l + Eii/^exp(<hfc(z;,(x)))] 


, for 1 < j < (J - 1), 


and p{J\zh{x)) = 


(4) 

(5) 


[1 + Elli^^exp($fc(z;,(x)))] 

The main implication of part (a) of Theorem 2 is that the classifier constructed using 
GAM and z/i(x) as the covariate tends to the Bayes classifier in a general nonparametric 
setup as h —)■ 0. On the other hand, part (6) of Theorem 2 implies that for elliptic class 
distributions, the same classifier tends to the Bayes classifier when h —)■ oo. When we £t 
GAM, the functions $jS are estimated nonparametrically. Flexibility of such nonparametric 
estimates automatically takes care of the constants for 1 < j < J and K{0) in the 

expressions of the limiting values of z/i(x) in parts (a) and {b) of Theorem 2, respectively. 

The empirical version of Ph(x, F), denoted by P/i(x, F„), is defined as 

^72 1 ^ 

Th{x,Fn) = - , 

i=l i=l 

^-1/2 

where tj = (x — x*) (or, S (x — Xj) if we use standardized version of the data) for 
1 < i < n. Then LSPD/i(x, is defined using (3) with P/i(x, F) replaced by P/i(x, F^). 

Theorem 3 below shows the almost sure uniform convergence of LSPD/i(x, Fn) to its popula¬ 
tion counterpart LSPD/j(x, F). Similar convergence result for the empirical version of SPD 
has been proved in the literature (see, e.g., [Tn]b 


Theorem 3 Suppose that the density function f and the kernel K are hounded, and K has 
bounded first derivatives. Then, for any fixed h > 0, supx \LSPDh{x., Fn) — LSPDh{x., F) | 

0 as n —)■ oo. 
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From the proof of Theorem 3, it is easy to check that this almost sure uniform convergence 
also holds when h —?■ oo. Under additional moment conditions on / and K, this holds for 
the h —)■ 0 case as well if nh?'^ / logn —)■ cxo as n —)■ cxo (see Remarks 2 and 3 after the proof 
of Theorem 3 in the Appendix). So, the result stated in parts (a) and (b) of Theorem 2 
continue to hold for the empirical version of LSPD under appropriate assumptions. 

Localization and kernelization of different notions of data depth have been considered in 
the literature [U [H [30l [191 132] • The fact that LSPD/i tends to a constant multiple of the 
probability density function as h —)■ 0 is a crucial requirement for limiting Bayes optimality 
of classihers based on this localized depth function. In [1], the authors proposed localized 
versions of simplicial depth and half-space depth, but the relationship between the local 
depth and the probability density function has been established only in the univariate case. 
A depth function based on inter-point distances has been developed in [30] to capture mul¬ 
timodality in a data set. Chen et al. [1] dehned kernelized spatial depth in a reproducing 
kernel Hilbert space. In [19], the authors considered a generalized notion of Mahalanobis 
depth in reproducing kernel Hilbert spaces. However, there is no result connecting them 
to the probability density function. Infact, the kernelized spatial depth function becomes 
degenerate at the value (1 — l/\/2) as the tuning parameter goes to zero. Consequently, it 
becomes non-informative for small values of the tuning parameter. It will be appropriate 
to note here that none of the preceding authors used their proposed depth functions for 
constructing classihers. Recently, in [531132], the authors proposed a notion of local depth 
and used it for supervised classihcation. But, their proposed version of local depth does not 
relate to the underlying density function either. 

4 Multiscale classification based on LSPD 

When the class distributions are elliptic, part (fe) of Theorem 2 implies that LSPD/j with 
appropriately large choices of h lead to good classihers. These large values may not be 


11 


appropriate for non-elliptic class distributions, but part (a) of Theorem 2 implies that LSPD/( 
with appropriately small choices of h lead to good classihers for general nonparametric models 
for class densities. However, for small values of h, the empirical version of LSPB^ behaves 
like a nonparametric density estimate, and it suffers from the curse of dimensionality. So, 
the resulting classiher may have its statistical limitations for high-dimensional data. 

We now consider two examples to demonstrate the above points. The hrst example (we 
call it E3) involves two multivariate normal distributions Nd(0(i,lrf) and Nrf(lrf,4Id). In the 
second example (we call it E4), both distributions are trimodal. The hrst class has the 
same density as in Figure 4 (i.e., an equal mixture of Nrf(0(i, 0.251^), Nd(21d, 0.251^) and 
Nd(41d, 0.25Id)), while the second class is an equal mixture of Nd(ld, 0.251^), Nd(31d, 0.251^) 
and Nd{51d, 0.251^). We consider the case d = 10 for E3 and d = 2 for E4. For each of these 
two examples, we generated a training sample of size 100 from each class. The misclassih- 
cation rate for the classiher based on LSPD/i was computed based on a test sample of size 
500 (250 from each class). This procedure was repeated 100 times to calculate the average 
misclassihcation rate for diherent values of h. Figure 6 shows that the large (respectively, 
small) values of h yielded low misclassihcation rates in E3 (respectively, E4). For small 
values of h, empirical LSPD/j behaved like a nonparametric density estimate that suhered 
from the curse of dimensionality in E4. Consequently, its performance deterioratesd. But, 



- 2-10 1 
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Figure 6: Misclassification rates in examples E3 and E4 for the classifier 
based on LSPD/i for different values of h. 
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for large h, the underlying elliptic structure was captured well by the proposed classiher. 
This provides a strong motivation for using a multi-scale approach in constructing the hnal 
classiher so that one can harness the strength of different classihers corresponding to different 
levels of localization of SPD. One would expect that when aggregated, classihers correspond¬ 
ing to diherent values of h will lead to improved misclassihcation rates. Usefulness of the 
multi-scale approach in combining diherent classihers has been discussed in the classihcation 
literature by several authors including [SI [HI [IS [22]. 


A popular way of aggregation is to consider the weighted average of the estimated pos¬ 
terior probabilities computed for diherent values of h. There are various proposals for the 
choice of the weight function in the literature. Following [13], one can compute A^, the 
leave-one-out estimate of the misclassihcation rate of the classiher based on LSPD/i and use 

1 (Afe-Ao)^ ' 

2 Ao(l - Ao)/n_ 

as the weight function, where Aq = min A/^. The exponential function helps to appropriately 

h 

weighing up (respectively, down) the promising (respectively, poor) classiher resulting from 
diherent choices of the smoothing parameter h. However, JW{h)dh or J p{j\zh{'x.))W{h)dh 
may not be hnite for some choices of j G {1, 2,..., J}. So, here we use a slightly modihed 
weight function W*{h) = W{h)g{h), where 5 ^ is a univariate Cauchy density with a large 
scale parameter and support restricted to have positive values only. Our hnal classiher, 
which we call the LSPD classiher, assigns an observation x to the j*-th class, where 


W (h) oc exp 


j* = arg max / W*{h) p(j|z/i(x))dh = arg max 

Here p(j|z/i(x)) is as in equations (4) and (5) in Section 3. In practice, we hrst generate 
M independent observations hi, ^ 2 ,..., Hm from g. For any given j and x, we approximate 
/h>o P(i|z/x(x))dh by P(i|zhi(x))/M. The use of the Cauchy distribu¬ 

tion with a large scale parameter (we use 100 in this article) helps us to generate small as 
well as large values of h. This is desirable in view of Theorem 2. 


W{h)g{h) p(j|z/,(x))dh. 


'h>0 
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5 Classification of high-dimensional data 


A serious practical limitation of many existing depth based classifiers is their computational 
complexity in high dimensions, and this makes such classihers impossible to use even for 
moderately large dimensional data. Besides, depth functions that are based on random 
simplices formed by the data points (see [291112]), cannot be dehned in a meaningful way 
if dimension of the data exceeds the sample size. Projection depth and Tukey’s half-space 
depth (see, e.g., |12]) both become degenerate at zero for such high-dimensional data. Clas- 
sihcation of high-dimensional data presents a substantial challenge to many nonparametric 
classihcation tools as well. We have seen in examples El and E2 (see Figure 2) that non¬ 
parametric classihers like those based on /c-NN and KDE can yield poor performance when 
data dimension is large. Some limitations of support vector machines for classihcation of 
high-dimensional data have also been noted in BB. 

One of our primary motivations behind using SPD is its computational tractability, es¬ 
pecially when the dimension is large. We now investigate the behavior of classihers based on 
SPD and LSPD for such high-dimensional data. For this investigation, we assume that the 
observations are all standardized by a common positive dehnite matrix S for all J classes, 
and the following conditions are stated for those standardized random vectors, which are 
written as Xs for notational convenience. 

(Cl) Consider a random vector Xi = (X;[^\..., ~ Fj. Assume that aj = 

linirf^oo d~^ Ylt=i exists for 1 < j < J, and d~^ ^ as d —?■ oo. 

(C2) Consider two independent random vectors Xi = (x}^\ ... ~ Fj and X 2 = 

(X 2 ^\ ..., ~ Fi. Assume that bji = lim^^oo F'(X;[^^X 2 ^^) exists, and 

d~^ bji as d — 00 for all I < j,i < J■ 

It is not difficult to verify that for Xi ~ Fj (1 < j < d), if we assume that the sequence 
of variables {x[^^ — E{x[^'^) : k = 1,2 ,...} centered at their means are independent with 
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uniformly bounded eighth moments (see Theorem 1 ( 2 ) in [2T], p. 4110), or if we assume that 
they are m-dependent random variables with some appropriate conditions (see Theorem 2 
in [5], p. 350), then the almost sure convergence in (Cl) as well as (C 2 ) holds. As a matter 
of fact, the almost sure convergence stated in (Cl) and (C 2 ) holds if we assume that for all 
the sequences {(xfV-E(xfV : fc = 1 , 2 ,...} and -E(A:f: 

k = 1 , 2 ,...}, where Xi ~ Fj and X 2 ~ T), are mixingales satisfying some appropriate 
conditions (see, e.g.. Theorem 2 in [S], p. 350). Dehne cr| = aj — bjj and Uji = bjj — 2bji + bn. 
For the random vector Xi ~ Fj, aj is the limit of d ^ f4(x{^^) as d —)■ cxo, where V{Z) 

denotes the variance of a random variable Z. If we consider a second independent random 
vector X 2 ~ Fi with i 7 ^ j, then Uji is the limit of d~^ as d —)■ cxo. 

In [15], the authors assumed a similar set of conditions to study the performance of the 
classiher based on support vector machines (SVM) with a linear kernel and the fc-NN classiher 
with fc = 1 as the data dimension grows to inhnity. Similar conditions on observation vectors 
were also considered in m to study the consistency of principal components of the sample 
dispersion matrix for high-dimensional data. Under (Cl) and (C2), the following theorem 
describes the behavior of z(x) and z/j(x) as d grows to inhnity. 


Theorem 4 Suppose that the conditions (C1)-(C2) hold, and X r\j Fj (l<J<d). 

(a) z(X) {cji, ..., Cjj)'^ = Cj as d ^ 00 , where Cjj = 1 — and Cji = 1 — 
for 1 < j ^ i < J ■ 

(h) Assume that h ^ 00 and d ^ 00 in such a way that y/d/h 0 or A{> 0). Then, 
z/i(X) g{0)cj or c'j = {g{ejiA)cji,..., g{ejjA)cjj)^ depending on whether y/d/h 
0 or A, respectively. Here K{t) = 5 f(||t||), Cjj = yploj and Cji = y^cr| -|- af + uji for j 7 ^ i. 

(c) Assume that h > 1, and yfd/h -^00 as d —)■ 00 . Then, z/j(X) Oj. 

The CjS as well as the c^s in the statement of Theorem 4 are distinct for all 1 < J < d 
whenever either (t| 7 ^ af or Vji 7 ^ 0 for all 1 < j 7 ^ i < J (see Lemma 2 in Appendix). In such 


Tf+Tf+Ffi 
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a case, part (a) of Theorem 4 implies that for large d, z(x) has good discriminatory power, 
and our classiher based on SPD can discriminate well among the J populations. Further, it 
follows from part (6) that when both d and h grow to inhnity in such a way that y/d/h —)■ 0 
or a positive constant, z/i(x) has good discriminatory power as well, and our classiher based 
on LSPD/i can yield low misclassihcation probability. However, part (c) shows that if y/d 
grows at a rate faster than h, z/i(x) becomes non-informative. Consequently, the classiher 
based on LSPD/j lead to high misclassihcation probability in this case. 


6 Analysis of simulated data sets 

We analysed some data sets simulated from elliptic as well as non-elliptic distributions. In 
each example, taking an equal number of observations from each of the two classes, we 
generated 500 training and test sets, each of size 200 and 500, respectively. We considered 
examples in dimensions 5 and 100. For classihers based on SPD and LSPD, we used the usual 
sample dispersion matrix of the j-th (j = 1, 2) class as Sj when d = 5. For d = 100, due 
to statistical instability of the sample dispersion matrix, we standardized each variable in a 
class by its sample standard deviation. Average test set misclassihcation rates of diherent 
classihers (over 500 test sets) are reported in Table 1 along with their corresponding standard 
errors. To facilitate comparison, the corresponding Bayes risks are reported as well. 

We compared our proposed classihers with a pool of classihers that include parametric 
classihers like PDA and QDA, and nonparametric classihers like those based on /c-NN (with 
the Euclidean metric as the distance function) and KDE (with the Gaussian kernel). For the 
implementation of PDA and QDA in dimension 100, we used diagonal estimates of dispersion 
matrices as in the cases of SPD and PSPD. For fc-NN and KDE, we used pooled versions of the 
scatter matrix estimates, which were chosen to be diagonal for d = 100. In Table 1, we report 
results for the multiscale methods of /c-NN [13] and KDE [T3| using the same weight function 
as described in Section 4. To facilitate comparison, we also considered SVM having the linear 
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kernel and the radial basis function (RBF) kernel (i.e., JF^(x,y) = exp{— 7||x — y||^} with 
the default value 7 = 1/d as in http://www.csie.ntu.edu.tw/~cjhn/libsvni/); the classiher 
based on classihcation and regression trees (CART) and a boosted version of CART known 
as random forest (RF). For the implementation of SVM, CART and RF, we used the R 
codes available in the libraries el071 [6], tree [35] and randomForest IZH, respectively. For 
classihers based on SPD and LSPD, we wrote our own R codes using the library VGAM jlO], 
and the codes are available at https://sites.google.coni/site/tijahbus/home/lspd. 

In addition, we compared the performance of our classihers with two depth based classi¬ 
hcation methods; the classiher based on depth-depth plots (DD) [25] and the classiher based 
on maximum local depth [33] (LD). The DD classiher uses a polynomial of class depths (usu¬ 
ally, half-space depth or projection depth is used, and depth is computed based on several 
random projections) to construct the separating surface. We used polynomials of diherent 
degrees and reported the best result in Table 1. For the LD classiher, we used the R package 
DepthProc and considered the best result obtained over a range of values for the localization 
parameter. However, in almost all cases, the performance of the LD classiher was inferior to 
that of the DD classiher. So, we did not report its misclassihcation rates in Table 1. 

6.1 Examples with elliptic distributions 

Recall examples El and E2 in Section 2 and example E3 in Section 4 involving elliptic 
class distributions. In El with d = 5, the DD classiher led to the lowest misclassihcation 
rate closely followed by SPD and LSPD classihers, but in the case oi d = 100, SPD and 
LSPD classihers signihcantly outperformed all other classihers considered here (see Table 
1). The superiority of these two classihers was evident in E2 as well. In the case of d = 5, 
the diherence between their misclassihcation rates was statistically insignihcant, though 
the former had an edge. Since the class distributions were elliptic, dominance of the SPD 
classiher over the LSPD classiher was quite expected. However, this diherence was found to 
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be statistically significant when d = 100. In view of the normality of the class distributions, 
QDA was expected to have the best performance in E3, and we observed the same. For 
d = 5, the DD classifier ranked second here, while the performance of SPD and LSPD 
classifiers was satisfactory. However, in the case oi d = 100, SPD and LSPD classifiers again 
outperformed the DD classifier, and they correctly classified all the test set observations. 


Table 1: Misclassification rates (in %) of different classifiers in simulated data sets. 


Data 

set 

Bayes 

risk 

LDA 

QDA 

SVM 

(LIN) 

SVM 

(RBF) 

fc-NN 

KDE 

CART 

RF 

DD 

SPD 

LSPD 

d = 5 

El 

26.50 

50.00 

(0.20) 

52.53 

(0.19) 

45.46 

(0.11) 

30.03 

(0.09) 

40.65 

(0.13) 

39.16 

(0.11) 

36.90 

(0.13) 

31.32 

(0.09) 

27 . 92 * 

(0.11) 

28.32 

(0.10) 

28.54 

(0.11) 

E 2 

0.00 

47.43 

(0.15) 

42.44 

(0.06) 

43.92 

(0.12) 

38.06 

(0.09) 

37.64 

(0.16) 

34.29 

(0.11) 

39.10 

(0.11) 

34.26 

(0.08) 

26.68 

(0.09) 

9.26 * 
(0.09) 

9.42 

(0.10) 

E 3 

10.14 

21.56 

(0.09) 

11.09 * 

(0.07) 

22.09 

(0.09) 

11.74 

(0.07) 

18.16 

(0.09) 

16.94 

(0.08) 

19.18 

(0.13) 

13.77 

(0.08) 

11.17 

(0.07) 

11.49 

(0.07) 

11.64 

(0.07) 

E 4 

2.10 

40.52 

(0.09) 

42.41 

(0.08) 

36.16 

(0.10) 

25.08 

(0.13) 

2.42 * 

(0.03) 

2.55 

(0.03) 

15.52 

(0.09) 

4.98 

(0.06) 

33.04 

(0.12) 

10.07 

(0.07) 

2.58 

(0.03) 

E 5 

2.04 

41.17 

(0.15) 

5.97 

(0.05) 

32.14 

(0.34) 

8.12 

(0.07) 

9.44 

(0.08) 

9.26 

(0.07) 

4.82 

(0.08) 

2.84 * 
( 0 . 03 ) 

5.82 

(0.05) 

5.65 

(0.06) 

5.52 

(0.06) 

d = 100 

El 

0.48 

50.29 

(0.10) 

50.67 

(0.13) 

46.85 

(0.11) 

24.97 

(0.06) 

44.57 

(0.08) 

49.99 

(0.10) 

35.72 

(0.12) 

25.14 

(0.12) 

24.99 

(0.10) 

1.60 * 

(0.11) 

2.34 

(0.12) 

E 2 

0.00 

43.77 

(0.09) 

46.13 

(0.04) 

43.99 

(0.09) 

40.32 

(0.06) 

49.96 

(0.02) 

49.22 

(0.06) 

40.30 

(0.11) 

32.36 

(0.10) 

27.56 

(0.09) 

2.90 * 
(0.08) 

3.18 

(0.09) 

E 3 

0.00 

0.46 

(0.01) 

0.00 * 

(0.00) 

3.21 

(0.05) 

0.00 * 

(0.00) 

49.99 

(0.00) 

49.98 

(0.00) 

17.40 

(0.12) 

0.02 

(0.00) 

1.92 

(0.02) 

0.00 * 

(0.00) 

0.00 * 

(0.00) 

E 4 

0.00 

33.40 

(0.00) 

33.40 

(0.00) 

46.28 

(0.10) 

19.43 

(0.09) 

0.00 * 

(0.00) 

0.00 * 

(0.00) 

17.28 

(0.00) 

0.00 * 

(0.09) 

23.15 

(0.10) 

0.00 * 

(0.00) 

0.00 * 

(0.00) 

E 5 

0.00 

46.74 

(0.29) 

0.00 * 
(0.00) 

44.45 

(0.31) 

7.83 

(0.15) 

44.01 

(0.21) 

49.98 

(0.04) 

3.32 

(0.11) 

0.00 * 

(0.00) 

3.12 

(0.10) 

0.00 * 

(0.00) 

0.00 * 

(0.00) 


The figure marked by is the best misclassification rate observed in an example. The other figures in bold 


(if any) are the misclassification rates whose differences with the best misclassification rate are statistically 
insignificant at the 5% level when the usual large sample test for proportion was used for comparison. 


In all these examples, the Bayes classifier had non-linear class boundaries. So, LDA 
and SVM with linear kernel could not perform well. The performance of SVM with the 
RBF kernel was relatively better. In E3, it had competitive misclassification rates for both 
values of d. k-NN and KDE had comparable performance in the case of d = 5, but in the 
high-dimensional case {d = 100), they misclassified almost half of the test cases. In [T5] . 
the authors derived some conditions under which the /c-NN classifier tends to classify all 
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observations to a single class when the data dimension increases to inhnity. These conditions 
hold in this example. It can also be shown that the classiher based on KDE with eqnal prior 
probabilities have the same problem in high dimensions. 

6.2 Examples with non-elliptic distributions 

Recall the trimodal example E4 discussed in Section 4. In this example, the LSPD classiher 
and the nonparametric classihers based on /c-NN and KDE signihcantly outperformed all 
other classihers in the case of d = 5. The diherences between the misclassihcation rates of 
these three classihers was statistically insignihcant. Interestingly, along with these classihers, 
the SPD classiher also led to zero misclassihcation rate for d = 100. The DD classiher, PDA, 
QDA and SVM did not have satisfactory performance in this example. 

The hnal example (we call it E5) is with exponential distributions, where the compo¬ 
nent variables are independently distributed in both classes. The i-th variable in the hrst 
(respectively, the second) class is exponential with scale parameter d/{d — i + 1) (respec¬ 
tively, d/2i). Further, the second class has a location shift such that the diherence between 
the mean vectors of the two classes is ^1^ = (1/d,..., 1/d)^. Recall that Figure 5 shows 
the density contours of the hrst class when d = 2. In this example, the RE classiher had 
the best performance followed by CART when d = 5. DD, SPD and LSPD classihers also 
performed well, and their misclassihcation rates were signihcantly lower than all other clas¬ 
sihers. Linear classihers like LDA and SVM with linear kernel failed to perform well. Note 
that as d increases, the diherence between the locations of these two classes shrinks to zero. 
This results in high misclassihcation rates for these linear classihers when d = 100. QDA 
performed reasonably well, and like SPD, LSPD and RE classihers, it correctly classihed all 
the test cases when d = 100. The DD classiher led to an average misclassihcation rate of 
3.12%. Again, the classihers based on /c-NN and KDE had poor performance for d = 100. 
This is due to the same reason as in E3 (see also na). Note that even in these examples 
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with non-elliptic distributions, the SPD classiher performed well for high-dimensional data. 
This can be explained using part (a) of Theorem 4. These examples also demonstrate that 
for non-elliptic or multimodal data, if not better, our LSPD classiher can perform as good 
as popular nonparametric classihers. In fact, this adjustment of LSPD classiher is automatic 
in view of the multiscale approach developed in Section 4. 

7 Analysis of benchmark data sets 

We analyzed some benchmark data sets for further evaluation of our proposed classihers. The 
biomedical data set is taken from the CMU data archive (http://lib.stat.cmu.edu/datasets/), 
the growth data set is obtained from [M], the colon data set is available in [2] (and also at 
the R-package ‘rda’), and the lightning 2 data set is taken from the UCR time series classi- 
hcation archive (http://www.cs.ucr.edu/~eamonn/time_series_data/). The remaining data 
sets are taken from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). 
Descriptions of these data sets are available at these sources. In the case of biomedical data, 
we did not consider observations with missing values. Satellite image (satimage) data set has 
specihc training and test samples. For this data set, we report the test set misclassihcation 
rates of diherent classihers. If a classiher had misclassihcation rate e, its standard error 
was computed as y^e(W^7y7(the'^ize~ofRie~tesF^^. In all other data sets, we formed the 
training and the test sets by randomly partitioning the data, and this random partitioning 
was repeated 500 times to generate new training and test sets. The average test set misclas¬ 
sihcation rates of diherent classihers are reported in Table 2 along with their corresponding 
standard errors. The sizes of the training and the test sets in each partition are also re¬ 
ported in this table. Since the codes for the DD classiher are available only for two class 
problems, we could use it only in cases of biomedical and Parkinson’s data, where it yielded 
misclassihcation rates of 12.54% and 14.48%, respectively, with corresponding standard error 
of 0.18% and 0.15%. In the case of growth data, where training sample size from each class 
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was smaller than the dimension, the valnes of randomized versions of half-space depth and 
projection depth were zero for almost all observations. Due to this problem, the DD classiher 
could not be used. We used the maximum LD classiher on these real data sets, but in most 
of the cases, its performance was not satisfactory. So, we do not report them in Table 2. 


Table 2: Misclassification rates (in %) of different classifiers in real data sets. 


Data set 

Biomed 

Parkinson’s 

Wine 

Waveform 

Vehicle 

Satimage 

Growth 

Lightning 2 

Colon 

Dimension (d) 

4 

22 

13 

21 

18 

36 

31 

637 

2000 

Classes (J) 

2 

2 

3 

3 

4 

6 

2 

2 

2 

Training size 

100 

97 

100 

300 

423 

4435 

46 

60 

31 

Test size 

94 

98 

78 

501 

423 

2000 

47 

61 

31 

LDA 

15.66 

30.93 

2.00 

19.90 

22.49 

16.06 

29.15 

32.51 

14.03 * 


(0.14) 

(0.12) 

(0.06) 

(0.15) 

(0.07) 

(0.82) 

(0.34) 

(0.35) 

(0.20) 

QDA 

12.57 

XXXX 

2.46 

21.12 

16.38 

14.14 

XXXX 

XXXX 

XXXX 


(0.13) 

xxxx 

(0.09) 

(0.15) 

(0.07) 

(0.78) 

xxxx 

xxxx 

xxxx 

SVM (LIN) 

22.03 

15.31 

3.64 

18.88 

21.20 

15.18 

5.16 

35.64 

16.38 


(0.13) 

(0.12) 

(0.09) 

(0.07) 

(0.07) 

(0.80) 

(0.12) 

(0.35) 

(0.23) 

SVM (RBF) 

12.76 

13.69 

1.86 

16.08 

25.57 

30.99 

4.66 * 

28.73 

35.48 


(0.13) 

(0.10) 

(0.06) 

(0.07) 

(0.08) 

(1.03) 

(0.11) 

(0.32) 

(0.00) 

fc-NN 

17.74 

14.42 

1.98 

16.37 

21.80 

9.23 + 

4.48 

30.09 

22.47 


(0.15) 

(0.16) 

(0.06) 

(0.08) 

(0.08) 

(0.65) 

(0.10) 

(0.20) 

(0.27) 

KDE 

16.67 

11.01 * 

1.36 * 

23.83 

21.21 

19.81 

4.79 

28.11 

23.20 


(0.14) 

(0.12) 

(0.05) 

(0.03) 

(0.07) 

(0.89) 

(0.13) 

(0.30) 

(0.28) 

CART 

17.69 

16.63 

10.99 

56.61 

31.41 

53.43 

17.40 

33.96 

28.78 


(0.18) 

(0.20) 

(0.22) 

(0.12) 

(0.10) 

(0.56) 

(0.25) 

(0.34) 

(0.35) 

RF 

13.23 

11.58 

2.12 

57.02 

25.52 

30.91 

9.67 

22.08* 

19.10 


(0.14) 

(0.15) 

(0.06) 

(0.12) 

(0.07) 

(0.48) 

(0.25) 

(0.34) 

(0.30) 

SPD 

12.53 

15.44 

2.34 

15.12 * 

16.35 * 

12.59 

14.64 

27.70 

19.98 


(0.21) 

(0.15) 

(0.08) 

(0.06) 

(0.08) 

(0.74) 

(0.28) 

(0.30) 

(0.31) 

LSPD 

12.49 * 

11.35 

1.85 

15.36 

17.15 

12.59 

5.10 

27.46 

20.51 


(0.15) 

(0.11) 

(0.07) 

(0.06) 

(0.08) 

(0.74) 

(0.12) 

(0.30) 

(0.33) 


The figure marked by is the best misclassification rate observed for a data set. The other figures in bold 
(if any) are the misclassification rates whose differences with the best misclassification rate are statistically 


insignificant at the 5% level. 


Because of the singularity of the estimated class dispersion matrices, QDA 


could note be used in some cases and those are marked by ‘xxxx’. 

In biomedical and vehicle data sets, scatter matrices of the competing classes were very 
different. So, QDA had significant improvement over LDA. In fact, its misclassification rates 
of QDA were close to the best ones. In both of these data sets, the class distributions were 
nearly elliptic (this can be verified using the diagnostic plots suggested in [26]). The SPD 
classifiers utilized the ellipticity of the class distributions to outperform the nonparametric 
classifiers. The LSPD classifier could compete with the SPD classifier in the biomedical data. 
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But in the vehicle data, where the evidence of ellipticity was much stronger, it had a slightly 
higher misclassification rate. 

In the Parkinson’s data set, we could not use QDA because of the singularity of the 
estimated class dispersion matrices. So, we used the estimated pooled dispersion matrix for 
standardization in our classifiers. In this data set, all nonparametric classifiers had signifi¬ 
cantly lower misclassification rates than LDA. Among them, the classifier based on KDE had 
the lowest misclassification rate. The performance of LSPD classiher was also competitive. 
Since the underlying distributions were non-elliptic, the LSPD classifier significantly outper¬ 
formed the SPD classiher. We observed almost the same phenomenon in the wine data set 
as well, where the classiher based on KDE yielded the best misclassihcation rate followed by 
the LSPD classiher. In these two data sets, although the data dimension was quite high, all 
competing classes had low intrinsic dimensions (can be estimated using mi So, the non¬ 
parametric methods like KDE were not much ahected by the curse of dimensionality. Recall 
that for small values of h, LSPD/j performs like KDE. Therefore, the diherence between the 
misclassihcation rates of KDE and LSPD classihers was statistically insignihcant. 

In the waveform data set, the SPD classiher had the best misclassihcation rate. In this 
data set, the class distributions were nearly elliptic. So, the SPD classiher was expected to 
perform well. As the LSPD classiher is quite hexible, it yielded competitive misclassihcation 
rates. Here, the class distributions were not normal (can be checked using the method in 
1351), and they did not have low intrinsic dimensions. As a result, other parametric as well 
as nonparametric classihers had relatively higher misclassihcation rates. 

Recall that in the satimage data set, results are based on a single training and a single 
test set. So, the standard errors of the misclassihcation rates were high for all classihers, 
and this makes it difhcult to compare the performance of diherent classihers. In this data 
set, /c-NN classihers led to the lowest misclassihcation rate, but SPD and LSPD classihers 
performed better than all other classihers. Nonlinear SVM, CART and RE had quite high 
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misclassification rates. 


We further analyzed some data sets, where the sample size was quite small compared to 
data dimension. In these data sets, we worked with unstandardized observations. Instead 
of using the estimated pooled dispersion matrix, we used the identity matrix for implemen¬ 
tation of LDA. The growth data set contains growth curves of males and females, which 
are smooth and monotonically increasing functions. Because of high dependence among the 
measurement variables, the class distributions had low intrinsic dimensions, and they were 
non-elliptic. As a result, the nonparametric classihers performed well. SVM with the RBF 
kernel had the best misclassihcation rate, but those of fc-NN, KDE and LSPD classihers were 
also comparable. Good performance of the linear SVM classiher indicates that there was a 
good linear separability between the two classes, but LDA failed to hgure it out. 

The lightning 2 data set consists of observations that are realizations of time series. In 
this data set, RF had the best performance followed by the LSPD classiher. Here also, we 
observed non-elliptic class distributions with low intrinsic dimensions [23]. This justihes the 
good performance of the classihers based on fc-NN and KDE. The SPD classiher also had 
competitive misclassihcation rates because of the hexibility of GAM. In fact, it yielded the 
third best performance in this data set. 

Finally, we analyzed the colon data set, which contains micro-array expressions of 2000 
genes for some ‘normal’ and ‘colon cancer’ tissues. In this data set, there was good linear 
separability among the observations from the two classes. So, LDA and linear SVM had 
lower misclassihcation rates than all other classihers. Among the nonparametric classihers, 
RF had the best performance closely followed by the SPD classiher. These two classihers 
were less ahected by the curse of dimensionality. Recall that LSPD/j with large bandwidth h 
approximates SPD. Because of this automatic adjustment, the LSPD classiher could nearly 
match the performance of the SPD classiher. 

To compare the overall performance of diherent classihers, following the idea of Pi, 
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LDA SVM-LINSVM-RBF KNN KNN-MS KDE KDE-MS CART RF SPD LSPD-MS 

Classification methods 


Figure 7: Overall efficiencies of different classifiers, 
we computed their efficiency scores on different data sets. For a data set, if T classifiers 
have misclassification rates ei,... ,eT, the efficiency of the t-th classifier (e*) is defined as 
Ct = Co/ct) where Cq = Clearly, in any data, the best classifier has = 1, while 

a lower value of e* indicates the lack of efficiency of the t-th classifier. In each of these 
benchmark data sets, we computed this ratio for all classifiers, and they are graphically 
represented by box plots in Figure 7. This figure clearly shows the superiority of the LSPD 
classifier (with the highest median value of 0.88) over its competitors. We did not consider 
QDA for comparison because it could not be used for some of the data sets. 

8 Concluding remarks 

In this article, we develop and study classifiers constructed by htting a nonparametric ad¬ 
ditive logistic regression model to features extracted from the data using SPD as well as its 
localized version, LSPD. The SPD classiher can be viewed as a generalization of parametric 
classifiers like LDA and QDA. When the underlying class distributions are elliptic, it has 
Bayes optimality. For large values of h, while LSPDft behaves like SPD, for small values of 
h, it captures the underlying density. So, the multiscale classifier based on LSPD is flexible. 
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and it overcomes several drawbacks associated with SPD and other existing depth based 
classifiers. When the underlying class distributions are elliptic but not normal, both SPD 
and LSPD classifiers outperform popular parametric classifiers like PDA and QDA as well as 
nonparametric classifiers. In the case of non-elliptic or multi-modal distributions, while SPD 
may fail to extract meaningful discriminatory features, the LSPD classifier can compete with 
other nonparametric methods. Moreover, for high-dimensional data, while traditional non¬ 
parametric methods suffer from the curse of dimensionality, both SPD and LSPD classifiers 
can lead to low misclassification probabilities. Analyzing several simulated and benchmark 
data sets, we have amply demonstrated this. In high-dimensional benchmark data sets, the 
class distributions had low intrinsic dimensions due to high correlation among the the mea¬ 
surement variables Moreover, the competing classes differed mainly in their locations. 
As a consequence, though the proposed LSPD classifier had the best overall performance in 
benchmark data sets, its superiority over other nonparametric methods was not as prominent 
as it was in the simulated examples. 


Appendix : Proofs and Mathematical Details 


Lemma 1 : If F has a spherically symmetric density /(x) = 5 f(||x||) on with d > 1, then 
\\Ep[u{'K — X)]|| is a non-negative monotonically increasing function of ||x||. 


Proof of Lemma 1 : In view of spherical symmetry of /(x), S'(x) = ||F^[m(x — X)]|| is 
invariant under orthogonal transformations of x. Consequently, S'(x) = r 7 (||x||) for some 
non-negative function rj. Consider now xi and X 2 such that ||xi|| < ||x 2 ||. Using spherical 
symmetry of /(x), without loss of generality, we can assume Xj = (tj, 0 ,..., 0)^ for i = 1, 2 
such that |ti| < 1^21- For any x = (t, 0,..., 0)^, we have 


S(x) = 


Er 


((-A'l) 


Ly(i-AiP + A| + ...+A'J. 

due to spherical symmetry of /(x). Note also that for any x G with d > 1, Fj7[||x — X| 
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is a strictly convex function of x in this case. Consequently, it is a strictly convex function 
of t. Observe now that *S'(x) with this choice of x is the absolute value of the derivative of 
£'i?[||x — X||] w.r.t. t. This derivative is a symmetric function of t that vanishes at t = 0. 
Hence, -S'(x) is an increasing function of |f|, and this proves that r 7 (||xi||) < r 7 (||x 2 ||). □ 

Proof of Theorem 1 : If the population distribution /j(x) (1 < j < J) is elliptically 
symmetric, we have /j(x) = Fj)), where 5(x,Fj) = {(x — S“^(x — 

= ||S^- —Since SPD(x,Fj) = 1 — ||£'{m(SJ^(x —^j))}|| is affine invariant, 

it is a function of (5(x, Fj), the Mahalanobis distance. Again, since — /x^) has a 

spherically symmetric distribution with its center at the origin, from Lemma 1 it follows that 
SPD(x, Fj) is a monotonically decreasing function of (5(x, Fj). So, (5(x, Fj) is also a function 
of SPD(x, Fj). Therefore, /j(x), which is a function of (5(x, Fj), can also be expressed as 

/j(x) = '0j(SPD(x, Fj)) for all 1 < j < J, 

where -^j is an appropriate real-valued function that depends on gj. Now, one can check that 

log{p(j|x)/p(J|x)} = log(7rj/7rj) + log V'j(SPD(x, Fj)) - log V'j(SPD(x, Fj)). 

for 1 < j < (J — 1). Now, for 1 < j 7 ^ i < (J — 1), dehne '^jj{z) = logvTj -|- \og'ijjj{z) and 
(Pij{z) = 0. So, if we dehne (pij{z) = ... = = -logvrj - log'0j(^), the proof of 

the theorem is complete. □ 

Remark 1 : If /j(x) is unimodal, ipjiz) is monotonically increasing for 1 < j < J. Moreover, 
if the distributions differ only in their locations, the 'ipj{z)s are same for all class. In that 
case, /j(x) > /i(x) 6{'x,Fj) > 6{'x,Fi) ioi 1 < i ^ j < J, and hence the classiher turns 

out to be the maximum depth classiher. 

Proof of Theorem 2 (a) : Let h < 1. For any hxed x G and the distribution function 
Fj, we have LSPD/j(x, Fj) = EF^[Kh{t)] - \\EFj[Kh{t)u{t)]\\, where t = Sj~^/^(x - X) for 
1<J< J. For the hrst term in the expression of LSPD/i(x, Fj) above, we have 

Er.,[K„{t)] = f fc-‘/4(S7'''"(x - v))/,(v)civ = |S,|V2 /■ 

jRd- jRd 
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where y = h — v). So, using Taylor’s expansion of /j(x), we get 

Ep,[K^(t)] = |E,|‘''y,(x) - f K(y) (S‘''V)'V/,(«)<iy, 

where ^ lies on the line joining x and (x—So, using the Cauchy-Scawartz inequality, 
one gets EF^[Kh{t)] - M°Mk, where M° = supxeRd ||V/,(x)||, 

Mk = f ||y||-^(y)<^y 5 and Aj is the largest eigenvalue of Sj for 1 < j < J. This implies 
Ep.lKhit)] — |Sj|^/^/j(x) —)■ 0 as h —)■ 0 for 1 < j < J. 


For the second term in the expression of LSPD/ 




EF^[Kh{t)u{t)\ = f K{y)u{y)fj{x - hi:y‘^y)dy 

jR‘i 

=-h|Sj|^/2^^iF(y)M(y) {'E]^^yyV fj{^)dy (since J K(y)u(y)dy = 0 ). 

So, \\EF^[Kh{t)u{t)]\\ < h\'Ejy/^Xy^M°MK ^ 0 as h ^0, and hence, LSPD;,(x, F^) ^ 
|Sj|^/^/j(x) as h —)■ 0. Consequently, z/i(x) (|Si|^/^/i(x),..., |Sj|^/^/j(x))^ as h —?■ 0. □ 


Proof of Theorem 2 (b) : Here we consider the case h > 1. Consider any hxed x G 
and any hxed j (1 < j < J). For any hxed t, since K(t/h) —?■ /P(0) as h —)> oo, using 
Dominated Convergence Theorem (note that K is bounded), one can show that 


LSPD;,(x,F,) ^ iF(0)SPD(x,F,) as h ^ oo. 


So, z;,(x) ^ (iF(O)SPD(x, El),..., JF(0)SPD(x, Ej))^ as h ^ oo. 


□ 


Proof of Theorem 3 : Dehne the sets = {x = (xi,...,^^) : ||x|| < y/dn}, and 
= {x : is an integer and |xj| < n for all 1 < i < d}. Clearly An C Bn C M'^, the set 

Bn is a closed ball and the set An has cardinality (2n^ + 1)'’*. We will prove the almost sure 
(a.s.) uniform convergence on the three sets: (i) on An (ii) on Bn \ An, and (iii) on Hjj. 


Consider any hxed h G (0,1]. Recall that for this choice of h, LSPD/j(x, F) (see equation 
(3)) and LSPD/i(x, F„) are dehned as follows: 


n 


LSPD;,(X, F„) = — > F(h-^(x - X,)) - 


nhd Z-^ 

2 = 1 


^5^A'(ft-'(x-X.))u(x-X, 


2=1 
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and LSPD;,(x, F) = h-‘^E[K{h-\x - X))] - h-'^\\E[K{h-\x - X)) m(x - X)]||. 


(i) Define Zj = K{h~^{^ — Xj))M(x — Xj) — E[K{h~^{^ — X))m(x — X)] for 1 < i < n. 
Note that ZjS are independent and identically distributed (i.i.d.) with -E(Zj) = 0 and 
||Zj|| < 2iP(0). Using the exponential inequality for sums of i.i.d. random vectors (see p. 
491 of [H]), for any hxed e > 0, we get P{\WYl^=i^i\\ ^ where Cq is a 

positive constant that depends on K{0) and e. This now implies that 


P 


< P 


i^Er=i^^(^-'(x-X.))M(x-X,) - h-''U[JP(h-i(x-X))M(x-X)] 


Er=i K{h-\^ - X^))«(x - X,) - h-'^E[K{h-\^ - X))m(x - X)] 



1 

= p(||-^Zi|| > ( 6 ) 

1=1 

For a hxed value of h, since Er=i X(/i“^(x — X*)) is a sum of i.i.d. bounded random vari¬ 
ables, using Bernstein’s inequality, we also have 


P 


n 


- X,)) - E[K{h-\^ - X))] 


2=1 


> e < 2 e 


—C\nt^ 


for some suitable positive constant Ci. This implies 


P 


nh^ 


K{h-\^ - Xi)) - h^E[K{h-\^ - X))] 


2 = 1 


> e < 2 e 


-CinP'^e^ 


(7) 


Combining equations ( 6 ) and (7), we get P(|LSPD(x, P„) — LSPD(x, P)| > e) < C^e 
for some suitable constants C 3 and C 4 . Since the cardinality of is {n^ + 1)'’*, we have 

P( sup |LSPD(x,P„) - LSPD(x,P)| > e) < C' 3 (n=^ + ( 8 ) 

xeA„ 


Now, + CAuP'^e^ ^ ^ ^ simple application of Borel-Cantelli lemma implies 

that supxgAn |LSPD/i(x, P„) — LSPD/i(x, P)| 0 as n ^ 00. 


(ii) Consider the set Bn \ An- Note that given any x in P„ \ An, there exists y G An 
such that ||x — y|| < y/2/iP. First we will show that |LSPD(y, P„) — LSPD(x, P„)| 0 as 

n ^ 00 . Using the mid-value theorem, one gets 


1 

nh'^ 


^P(h-i(x- 

2=1 


Xi)) 


1 

n/i'^ 


^P(h-i(y-X,)) 

2=1 


1 ” 

2=1 
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where ^ lies on the line joining x and y. Note that the right hand side is less than 
where = snp^ ||V-ft'(t)||. This npper bonnd is free of x, and goes to 0 as n —)■ cxo. Now, 


nh'^ 


5^i^(h-'(x-X,))n(x-X,) 


i=l 


nh‘^ 


5^i^(h-'(y-X,))n(y-X, 


2 = 1 


< 


^ X^[X(h-i(x - X,))n(x - X,) - K{h-\y - X,,))n(y - X,)] 


2=1 


< 


nh^ 


- X.)) - K{h-\y - X,))] 


i=l 


+K{0) 


nh^ 


^{m(x - Xi) - u{y - Xj)} 


2=1 


We have already proved that the hrst part converges to 0 in a.s. sense. For the second 
part, consider a ball of radins 1/n aronnd x (say, i?(x, 1/n)). Now, 


K(0) 


nh^ 


^{m(x - Xi) - M(y - Xi)} 


2 = 1 


< 


2=1 


2nK{0) 

+ ^ I|x-y|| 




< 


2K{0) 




2=1 


Y, ^{X. e S(x, l/n)} - P{X, e B(x, 1/n)} 

e B(x, 1/n)} + 


2nK{0)y/2 


n 


2hd 


Note that /{Xi G i?(x, 1/n)} are i.i.d. bonnded random variables with expectation 
P{Xi G /?(x, 1/n)}. So, the a.s. convergence of the hrst term follows from Bernstein’s 
ineqnality. Since P{Xi G i?(x, 1/n)} < ii//n“'^ (where Mf = snpx/(x) < oo), the second 
term converges to 0. The third term also converges to 0 as n —)> oo. Therefore, we have 
|LSPD(x,F„)-LSPD(y,F„)| ^ 0 as n ^ cx). 


Similarly, one can prove that |LSPD(x, F) — LSPD(y, F)| ^ 0 as n —)■ oo. Note that 
in the argnments above, all bonnds are free from x and y. We have also proved that 
snpyg^^ |LSPD(y, F„) — LSPD(y, F)| 0 as n —)■ oo. So, combining these results, we have 

supxen„\An |LSPD/i(x, F„) - LSPDft(x, F) | ^ 0 as n oo. 

(hi) Now, consider the region outside B„, (i.e., B^). First note that 

1 ” 

sup |LSPD/j(x, F„)—LSPD(x, F)| < sup ——r F(/i“^(x— Xi))+ sup h~^E \K{h~^{-x. — X))] . 

xeB- ^ xeB- 

We will show that both of these terms become sufficiently small as n —)■ oo. 


29 


































Fix any e > 0. We can choose two constants Mi and M 2 such that P(||X|| > Mi) < 
h'^e/2K{0) and K{t) < h'^e/2 when ||t|| > M 2 . Now, one can check that 

h-^E [K{h-\x - X))] < h-^E [K{h-\x - X))J(||X|| < Ml)] + rtF(0)P(||X|| > Mi). 

Note that if x G P)) and ||X|| < Mi, h“^||x — X|| > h~^\Vdn — Mi\. Now, choose n large 
enough such that \\/dn — Mi\ > M 2 h, and this implies K{h~^{x. — X)) < h'^e/2. So, we get 

[K{h-\^ - X))] < e/2 + h-'^P(0)P(||X|| > Ml) < e, and 

1 " 

l)-5^/(||X.|| >Mi) 

i=\ 

1 ” 

-5^/(I|x.||>Mi)-p(||x||>m,) . 

n 

i=\ 

The Glivenko-Cantelli theorem implies that the last term on the right hand side converges 
to 0 as n —)■ cxo. So, we have supxgB= |LSPD/i(x, P„) — LSPD/i(x, P) | ^ 0 as n ^ cxo. 

Combining (i), (ii) and (iii), we now have supx |LSPD/i(x, P„) — LSPD/j(x, P)| 0 for 

any h G (0,1]. 

For any hxed h > 1, this a.s. convergence can be proved in a similar way. In that case, 
recall that the dehnition of LSPD does not involve the h'^ term in the denominator. □ 

Remark 2: Following the proof of Theorem 3, it is easy to check that the a.s. convergence 
holds when h diverges to inhnity at any rate with n. 

Remark 3: The result continues to hold when h — )■ 0 as well. However, for the a.s. 
convergence in part (i), (more specihcally, to use the Borel-Cantelli lemma), we require 
/ logn —)■ 00 as n ^ 00 . In part (iii), we need Mi and M 2 to vary with n. Assume the 
hrst moment of / to be hnite, and J ||t||P(t)dt < 00 (which implies ||t||P(t) —)■ 0 as ||t|| 
00). Also assume that nh^'^/logn —)■ cxo as n ^ 00 . We can now choose Mi = M 2 = 
to ensure that both P(||X|| > Mi) < h^e/2K{Q) and P(t) < h^e/2 for ||t|| > M 2 hold for 
sufficiently large n. 


^ K(h-\^ - Xj)) < £/2 + h-^KH 


2 = 1 


< e + h-''P(0) 
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Proof of Theorem 4 (a) : Consider two independent random vectors X = (Xd),..., 
~ Fj and Xi = ~ Fj, where I < j < J. It follows from (Cl) and 

(C2) that ||X - Xi||/\/d 3 jS d —^ oc• So^ for Qilxxiost ovory roctlizoifioxi of 

||x — Xi||/\/d as d —)■ cx). (9) 

Next, consider two independent random vectors X ~ Fj and Xi ~ Fj for 1 < i ^ j < J. 
Using (Cl) and (C2), we get ||X — Xi||/\/d + af + fji as d ^ oo. Consequently, for 

almost every realization x of X ~ Fj 

||x - XiII/\/d \J(y‘j + O’f + Vji as d -)■ CX), (10) 

Let us next consider (x — Xi,x — X 2 ), where X ~ Fj, Xi,X 2 ~ Fj are independent 
random vectors, and {•, •) denotes the inner product in Therefore, for almost every 
realization x of X, arguments similar to those used in (8) and (9) yield 

- - - > aj SlS d ^ 00 if 1 < I = J < and (11) 

(jj 

- -^^ (t| + z/jj as d —)■ oo if 1 < i 7 ^ j < J. (12) 

Observe now that \\Ef.[u{-x. - X)]|p = {Ef^[u{-x. - X^)], Efj[u{-x. - X 2 )]) = F^r {(m(x - 
Xi),n(x — X 2 ))}, where Xi,X 2 ~ Fj are independent random vectors for 1 < j < J. 

Since here we are dealing with expectations of random vectors with bounded norm, a sim¬ 
ple application of Dominated Convergence Theorem implies that for almost every realization 
X of X ~ Fj (1 < i < J), as d —)■ cxo, 

SPD(x, Fj) “4- 1 - and SPD(x, F,) “4 1 - for * 7 ^ J. (13) 

Therefore, for X ~ Fj, we get z(X) = (SPD(X, Fi),..., SPD(X, Fj))^ 4 Cj, as d ^ 00. □ 

Proof of Theorem 4 (b) : Recall that for h > 1, LSPD/i(x, F) = Fi?[4F/i(t)] — 
\\EF[h‘^Kh(t)u(t)] II, and since we have assumed Xs to be standardized, here we have h^Khit) = 
F((x — X)/h) = (?(||x — X||/h). Let X ~ F and Xj ~ Fi where 1 < i < J. Then, using (8) 
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and (9) above, and the continuity of g, for almost every realization x of X ~ one gets 

^ 9{ejiA), 

depending on whether y/d/h —)■ 0 or A, for almost every realization x of X ~ Fj. The proof 
now follows from a simple application of Dominated Convergence Theorem. □ 


Proof of Theorem 4 (c) : Since g{s) —)■ 0 as s ^ oo, using the same argument as used in 

the proof of Theorem 3(b), for Xj ~ Fj and almost every realization x of X ~ F^, we have 

x-X, 
yfd 

The proof now follows from a simple application of Dominated Convergence Theorem. □ 





h 


^ 0 as y/d/h —)■ oo. 


Lemma 2 : Recall Cj and c' for 1 < j < J dehned in Theorem 3 (a) and (b), respectively. 
For any l<jy^i<J,Cj = Ci if and only if aj = (Ji and = 0. Similarly, c'- = c' if 

and only if Oj = Oi and Vji = Vij = 0 . 


Proof of Lemma 2 : The ‘if’ part is easy to check in both cases. So, it is enough to 
prove the ‘only if’ part and that too for the case of J = 2. Note that if Ci = (cn, 012 )^ and 
C 2 = (c 2 i, 022 )^ are equal, we have 


(y\ + 

af + al + U12 


1/2 and — = 1 / 2 . 

al + a^ + U 12 


These two equations hold simultaneously only if af = (t| and z/i 2 (= z/ 21 ) = 0. 


Now consider the case c'^ = 03 . Recall that = g{Ay/2ai)cii, C 22 = g{Ay/2a2)c22, 
^12 = 9 {Ay/(j\ + (t| + z/i 2 )ci 2 and = c4i = g{Ay /+ erf + z/ 2 i)c 2 i. If possible, assume that 
a I > ( 72 . This implies that Ay/af + (7^ + > Ay/2cri and hence 


g(Ay/2ai) > g{AyJ (7^ + (72 + z/ 12 ) (since is monotonically decreasing). (14) 


Also, if ( 7 i > ( 72 , we must have 


1/2 < 


(7r 


< 


<^1 + 1^12 


(7t + (7$ (7t + (75 + Z/12 


< 1 1 - y/lj2 > 1 


O'! + ^12 

(7? + (7| + Z/12 


(15) 
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Combining (13) and (14), we have > C 21 , and this implies c'^ 7 ^ C 2 . Similarly, if ai < (T 2 , 

we get c '^2 > C 22 and hence c'^ 7 ^ C 2 . Again, if ai = <72 but 1^12 = 1^21 > 0, similar arguments 

lead to c'^ 7 ^ C 2 . This completes the proof of the lemma. □ 
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